dgit.raspbian.org Git

EFI: add efi_arch_handle_cmdline() for processing commandline

Add arch function for processing the Xen commandline and
updating internal structures.

Signed-off-by: Roy Franz <roy.franz@linaro.org>
Acked-by: Jan Beulich <jbeulich@suse.com>

EFI: add efi_arch_cfg_file_early/late() to handle arch specific cfg file fields

Different architectures have some different configuration file
fields that need to be handled.  In particular, x86 has ucode
and ARM has device tree files to be loaded.  These arch specific
functions is used to allow each architecture to implement these
features in arch specific code.  Early/late versions are provided,
as ARM needs to process the DTB entry first, and x86 wants to process
the ucode entry last as it is the smallest allocation.

Signed-off-by: Roy Franz <roy.franz@linaro.org>
Acked-by: Jan Beulich <jbeulich@suse.com>

EFI: add architecture functions for pre/post ExitBootServices

The UEFI ExitBootServices function is invoked to transition the
system to the 'runtime' mode of operation, and is done right before
transitioning from the EFI loader code into Xen proper. x86 does some
arch specific memory management (trampoline) before exit boot services,
and the code that transitions from the EFI application state to Xen
is architecture specific. This patch adds two functions, one pre
and one post ExitBootServices to allow each architecture to
to handle these cases in a customized manner.

Signed-off-by: Roy Franz <roy.franz@linaro.org>
Acked-by: Jan Beulich <jbeulich@suse.com>

EFI: create arch functions to allocate memory for and process memory map

The memory used to store the EFI memory map is allocated in an architecture
specific way, and the processing of the memory map itself uses x86 specific
data structures. This patch adds architecture specific funtions so each
architecture can provide its own implementation.

Signed-off-by: Roy Franz <roy.franz@linaro.org>
Acked-by: Jan Beulich <jbeulich@suse.com>

EFI: move x86 specific functions/variables to arch header

Move the global variables and functions that can be moved as-is
from the common boot.c file to the x86 implementation header file.

Signed-off-by: Roy Franz <roy.franz@linaro.org>
Acked-by: Jan Beulich <jbeulich@suse.com>

EFI: move x86 boot/runtime code to common/efi

This moves the EFI boot and runtime services code to the common/efi directory.
This code is symbolicly linked back into the arch/x86/efi directory where it is
built if a build-time check for PE/COFF support in the toolchain passes.  In
the PE/COFF supporting case, both the EFI executable and the normal Xen image
(with stubbed EFI functions) are built.  We can't use the normal common build
infrastructure since we are building two versions at the same time, with
different EFI related code in each.  No code changes, just file movement and
make updates.  The files are symbolicly linked at build time back toe the
original arch/x86/efi directory.  This is in preparation for adding ARM EFI
support where much of these files can be shared.

Signed-off-by: Roy Franz <roy.franz@linaro.org>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/vlapic: don't silently accept bad vectors

Vectors 0-15 are reserved, and a physical LAPIC - upon sending or
receiving one - would generate an APIC error instead of doing the
requested action. Make our emulation behave similarly.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

x86/vlapic: a few type adjustments

Constify a couple of pointer parameters, convert a boolean function
return type to bool_t, and clean up a printk() being touched anyway.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

x86/HVM: fix ID handling of x2APIC emulation

- properly change ID when switching into x2APIC mode (instead of
  mimicking necessary behavior in hvm_x2apic_msr_read())
- correctly (meaningfully) set LDR (so far it ended up being 1 on all
  vCPU-s)
- even if we don't support more than 128 vCPU-s in a HVM guest for now,
  we should properly handle IDs as 32-bit values (i.e. not ignore the
  top 24 bits)
- with that, properly do cluster ID and bit mask check in
  vlapic_match_logical_addr()

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

x86/HVM: fix miscellaneous aspects of x2APIC emulation

- generate #GP on invalid APIC base MSR transitions
- fail reads from the EOI and self-IPI registers (which are write-only)
- handle self-IPI writes and the ICR2 half of ICR writes largely in
  hvm_x2apic_msr_write() and (for self-IPI only) vlapic_apicv_write()
- don't permit MMIO-based access in x2APIC mode
- filter writes to read-only registers in hvm_x2apic_msr_write(),
  allowing conditionals to be dropped from vlapic_reg_write()
- don't ignore upper half of MSR-based write to ESR being non-zero
- don't ignore other writes to reserved bits
- VMX's EXIT_REASON_APIC_WRITE must not result in #GP (this exit being
  trap-like, this exception would get raised on the wrong RIP)
- make hvm_x2apic_msr_read() produce X86EMUL_* return codes just like
  hvm_x2apic_msr_write() does (benign to the only caller)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

x86: use constant as multiboot protocol identifier

... instead of plain number.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper@citrix.com>

x86: define e820 entries counter as unsigned int

e820 entries counter is inherently an unsigned quantity
so define it as unsigned int.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/hvm: Forced Emulation Prefix for debug builds of Xen

Analysis of XSAs 105 and 106 show that is possible to force a race condition
which causes any arbitrary instruction to be emulated.

To aid testing, explicitly introduce the Forced Emulation Prefix for debug
builds alone.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

misc/coverity: Model __builtin_unreachable()

This resolves 23 issues Coverity had identified by following the false path of
an ASSERT().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/LAPIC: drop support for non-integrated APIC

We never really supported such, even in the 32-bit days.

As a minor extra thing move the APIC_SELF_IPI definition out of the
middle of Divider Configuration Register ones.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86: make dump_pageframe_info() slightly more verbose for dying domains

Allowing more than just 10 pages to be printed in this case gives a
better chance to fully understand eventual page reference leaks: Report
up to 16 "normal" (writable or untyped) pages, and an unlimited number
of special type (page or descriptor table) ones.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86emul: fix SYSCALL/SYSENTER/SYSEXIT emulation

SYSCALL:
- make sure SS selector has RPL 0
- only use 32 bits of RIP to fill RCX when target execution mode is 32-bit
- don't shadow function wide variable 'rc'
- consolidate CS attribute setting into single statements
- drop pointless initializers and casts
- drop redundant MSR_STAR read (as suggested by Andrew Cooper)

SYSENTER/SYSEXIT:
- #GP condition doesn't depend on guest mode
- only use 32 bits for setting RIP/RSP when target execution mode is 32-bit
- don't shadow function wide variable 'rc'
- consolidate CS attribute setting into single statements
- drop pointless (and inconsistently used) casts

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/p2m: typo fix for spelling ambiguous

Signed-off-by: Tamas K Lengyel <tklengyel@sec.in.tum.de>
Acked-by: Tim Deegan <tim@xen.org>

Merge branch 'staging' of ssh://xenbits.xen.org/home/xen/git/xen into staging

x86/EFI: fix freeing of uninitialized pointer

The only valid response from the LocateHandle() call is EFI_BUFFER_TOO_SMALL,
so exit if we get anything else.  We pass a 0 size/NULL pointer buffer, so the
only other returns we will get is an error.  Return right away as there is
nothing to do.  Also return if there is an error allocating the buffer, as the
previous code path also allowed for an undefined pointer to be freed.

Signed-off-by: Roy Franz <roy.franz@linaro.org>
Re-structure the change.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

flask/policy: use naming convention xenpolicy-$(XEN_FULLVERSION)

Daniel suggested we use xenpolicy-$(XEN_FULLVERSION) as flask policy
naming convention.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Cc: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>

Fix QEMU cross-compile build

Introduce the per-arch IOEMU_CPU_ARCH variable.
Always pass --configure=IOEMU_CPU_ARCH to QEMU's configure script.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- dropped redundant comments ]

MAINTAINERS: Add Wei Liu as toolstack co-maintainer.

The three existing maintainers are not really able to keep up with
the flow and Wei is one of the top tools contributors (according to
"git shortlog -s -n -p RELEASE-4.4.0..origin/staging tools" and my
own impressions).

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

docs: add PVH specification

Introduce a document that describes the interfaces used on PVH. This
document has been designed from a guest OS point of view (i.e.: what a guest
needs to do in order to support PVH).

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Mukesh Rathor <mukesh.rathor@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>

xen/arm: remove check for generic timer support for arm64

Information about support for generic support is available in
IDR_PFR1 register in ARMv7. Where as this information is not
available in ARMv8 that supports only aarch64 bit mode.
ARMv8 being always supports generic timer, this check is not
required.

For platforms that support only aarch64 mode, IDR_PFR1 is
not implemented

Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

xen/arm: Restricted access to IFSR32_EL2 and FPEXC32_EL2

IFSR32_EL1 and FPEXC32_EL1 registers are accessible in
aarch64 mode only if aarch32 mode is support in EL1.
So allow access to these registers only for 32-bit domains.

Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

xen/arm: Use REG_RANK_INDEX macro

Use REG_RANK_INDEX macro to compute index to access
vgic ipriority[] and itargets[] for a given irq.

Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

libxl: Remove a duplicate calculation of be_path

Coverity-ID: 1238177
CC: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Don Slutz <dslutz@verizon.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

make: Make "src-tarball" target actually make a source tarball

At the moment, making a release tarball is an annoyingly manual
process that involves running "git archive" into a temporary directory.

Script this process up and make a target, so that the release manager
can simply type "make src-tarball-release" and have everything show up
nice and neat in dist/xen-$version.tar.gz. "make src-tarball" will
make a version number based on git describe, which will typically have
the most recent tag, number of commits since that tag, and the git
commit id of the current HEAD.

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

make: Add subtree-force-update target

subtree-force-update will update all subtrees according to the current TAG specified
in Config.mk.

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>

xen: arm: Add support for the Exynos secure firmware

The existence of secure firmware is dictated by the presence of
"samsung,secure-firmware" in the DT.

The Arndale board does not have that entry, and uses the address as defined
in "samsung,exynos4210-sysram", offset 0 as the smp init address. This is
possibly true for all SoCs without secure firmware.

For other boards which do have a "secure-firmware" node, use sysram-ns
at offset +0x1c as the smp init address.

The "secure-firmware" MMIO range contains ways to idle the CPU. As this gets
mapped to DOM0 because of its presence in the DT, we blacklist it.

Have tested this on the Odroid XU. I have also tested the other code path
on the Odroid XU by removing "secure-firmware" from its DT. I could see
that the other code path was exercised with correct smp init address
values.

Signed-off-by: Suriyan Ramasami <suriyan.r@gmail.com>
Tested-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

x86/viridian: Add partition time reference counter MSR support

This patch optionally re-instates support for the partition time reference
counter that was previously introduced by commit
e36cd2cdc9674a7a4855d21fb7b3e6e17c4bb33b and reverted by commit
1cd4fab14ce25859efa4a2af13475e6650a5506c. The previous implementation was
non-optional and flawed.

This implementation uses the tsc of vcpu0, which is preserved across
save/restore as part of the architectural state, and then converts that
to a 100ns tick using the domain's tsc_khz.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Keir Fraser <keir@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Christoph Egger <chegger@amazon.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

x86/viridian: Re-purpose the HVM parameter to be a feature mask

The following commits introduced the time reference counter MSR and
TSC/APIC frequency MSRs into the viridian feature set respectively:

e36cd2cdc9674a7a4855d21fb7b3e6e17c4bb33b
84657efd9116f40924aa13c9d5a349e007da716f

The time reference counter MSR feature was then reverted by commit

1cd4fab14ce25859efa4a2af13475e6650a5506c

because a flaw in the implementation meant the counter was reset on
migration.

All of these changes were made without any addtional options being
added to the VM configuration, or any compatibility checks being made
in the domain save/restore code. Hence setting the single boolean
'viridian' option in the VM configuration yields a different set of
features depending on which version of Xen the VM is started on, and the
feature set can change across migration (so new MSRs can magically appear).

This patch grandfathers in the current viridian features set and calls them
the 'base' and 'freq' feature sets. HVM_PARAM_VIRIDIAN is re-purposed as
a feature mask. The hypervisor has only ever allowed it ot be set to 0
or 1, so the presence of the base and freq sets are indicated by setting
bit 0. The freq set can then be turned off by setting bit 1, thus
restoring the pre-Xen-4.4 base set. Newly implemented viridian features
can be optionally enabled in future by setting further bits.

The viridian option in xl.cfg(5) has also been changed to a list so
that the sets can be individually enabled or disabled. For compatibility,
if the option is specified as a boolean, then a true (1) value will enable
the base and freq sets and a false (0) value will not enable any
enlightenments.

This patch also alters the allowed write accesses to HVM_PARAM_VIRIDIAN.
Currently there is nothing to stop the guest writing this value (which,
while harmless to anything else, should not happen) and nothing to
stop a toolstack from setting the value back to zero whilst the guest is
running, causing CPUID leaves to disappear and MSR accesses to start
causing GPFs in the guest. Both of these possibilities are now disallowed:
Once the parameter is set to a non-zero value it may not be modified (only
re-written with the same value), and guests no longer have any write
access.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Keir Fraser <keir@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: David Scott <dave.scott@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

xl: introduce rtds scheduler

Add xl command for rtds scheduler
Note: VCPU's parameter (period, budget) is in microsecond (us).

Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
Signed-off-by: Sisu Xi <xisisu@gmail.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

libxl: add rtds scheduler

Add libxl functions to set/get domain's parameters for rtds scheduler
Note: VCPU's information (period, budget) is in microsecond (us).

Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
Signed-off-by: Sisu Xi <xisisu@gmail.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

libxc: add rtds scheduler

Add xc_sched_rtds_* functions to interact with Xen to set/get domain's
parameters for rtds scheduler.
Note: VCPU's information (period, budget) is in microsecond (us).

Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
Signed-off-by: Sisu Xi <xisisu@gmail.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
[ ijc -- xenctrl.h has moved to tools/libxc/include, adjust patch to match ]

xen: add real time scheduler rtds

This scheduler follows the Preemptive Global Earliest Deadline First
(EDF) theory in real-time field.
At any scheduling point, the VCPU with earlier deadline has higher
priority. The scheduler always picks the highest priority VCPU to run on a
feasible PCPU.
A PCPU is feasible if the VCPU can run on this PCPU and (the PCPU is
idle or has a lower-priority VCPU running on it.)

Each VCPU has a dedicated period and budget.
The deadline of a VCPU is at the end of each period;
A VCPU has its budget replenished at the beginning of each period;
While scheduled, a VCPU burns its budget.
The VCPU needs to finish its budget before its deadline in each period;
The VCPU discards its unused budget at the end of each period.
If a VCPU runs out of budget in a period, it has to wait until next period.

Each VCPU is implemented as a deferable server.
When a VCPU has a task running on it, its budget is continuously burned;
When a VCPU has no task but with budget left, its budget is preserved.

Queue scheme:
A global runqueue and a global depletedq for each CPU pool.
The runqueue holds all runnable VCPUs with budget and sorted by deadline;
The depletedq holds all VCPUs without budget and unsorted.

Note: cpumask and cpupool is supported.

This is an experimental scheduler.

Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
Signed-off-by: Sisu Xi <xisisu@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Tested-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
[ ijc -- use PRI_stime to print delta in burn_budget, to fix build on
32-bit (i.e. arm32) ]

tools: enable QEMU for ARM builds

Build qemu-xen on ARM and ARM64: it is used to provide the PV backends,
disk and framebuffer in particular.

Ideally we would also modify the configure options to only build what is
necessary: a machine just for PV backends. However that is a work in
progress and not yet available in QEMU (see
http://marc.info/?l=qemu-devel&m=139082425718379&w=2). So we just build
the usual i386 target, even though no i386 emulation is going to be done
by qemu-xen on ARM.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

Move xenstore and libxc public headers to include subdir

Also moves xc_dom.h to include as it is used often by other xen tools.
Use the new include subdirectories to build Xen tools, qemu-xen and
stubdoms.

Add the old libxc include path to the programs that need it to build,
on a case by case basis and commeting that they shouldn't require
internal libxc headers to build.

[ And: update QEMU_TRADITIONAL_REVISION to corresponding qemu patch
- Ian jackson ]

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

Rerun autogen.sh after 7d7147762282 "Use configure --sysconfdir=DIR to se..."

I tried to do this but failed to commit --amend correctly before pushing.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>

x86emul: only emulate software interrupt injection for real mode

Protected mode emulation currently lacks proper privilege checking of
the referenced IDT entry, and there's currently no legitimate way for
any of the respective instructions to reach the emulator when the guest
is in protected mode.

This is XSA-106.

Reported-by: Andrei LUTAS <vlutas@bitdefender.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>

x86/emulate: check cpl for all privileged instructions

Without this, it is possible for userspace to load its own IDT or GDT.

This is XSA-105.

Reported-by: Andrei LUTAS <vlutas@bitdefender.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrei LUTAS <vlutas@bitdefender.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/shadow: fix race condition sampling the dirty vram state

d->arch.hvm_domain.dirty_vram must be read with the domain's paging lock held.

If not, two concurrent hypercalls could both end up attempting to free
dirty_vram (the second of which will free a wild pointer), or both end up
allocating a new dirty_vram structure (the first of which will be leaked).

This is XSA-104.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>

Use configure --sysconfdir=DIR to set CONFIG_DIR

Preserve existing behaviour: if the option was not given, set existing
defaults for FreeBSD, Solaris and everything else.

Signed-off-by: Olaf Hering <olaf@aepfle.de>

tools/libxc: use XEN_RUN_DIR for SUSPEND_LOCK_FILE

Remove hardcoded /var/run/xen directory path, use XEN_RUN_DIR instead.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

tools/libxc: provide variable paths to libxc

In preparation to remove hardcoded /var/run/xen paths, provide
XEN_RUN_DIR and related directories to xc_private.h. Similar code exists
already for libxl, stubdom and other parts.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

tools/libxl: use buildmakevars2header to create _paths.h

Replace usage of buildmakevars2file with buildmakevars2header. The macro
generates a C header file, so remove code which converts shell variables
into C defines. Also update the dependency, the macro itself creates a
dependency for _paths.h. A temporary file is not needed anymore.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

Config.mk: add new macro buildmakevars2header

This macro is similar to buildmakevars2file, it just creates a C header
file instead of shell style syntax. Upcoming changes will use this macro
in libxl and libxc.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

Config.mk: replace dependency to genpath with actual target

genpath is a detail of buildmakevars2file. Replace the dependency to
genpath with the actual buildmakevars2file target. This change by
itself does not fix any bug. Upcoming changes will add dependencies to
$(target), but no rule exist to create $(target).

To force a rebuild of the $(1) rule the target now depends on the
existing .phony target. This dummy target is already used elsewhere in
the code.

No change in behaviour is expected by this patch.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

Config.mk: move directory list into BUILD_MAKE_VARS

To maintain the list of directories in a single place, move the existing
list into its own variable and use it in buildmakevars2file.
Required for upcoming changes.
Trim also whitespaces.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

remove obsolete SUBSYS_DIR variable

/var/run is a runtime directory. It is not supposed to be packaged.
Remove unused SUBSYS_DIR variable from Config.mk and distro_mapping.txt.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

tools/examples: remove obsolete install targets

install-hotplug and install-udev are obsolete since commit 57bcfa11
("tools/hotplug: Separate OS-specific scripts.")

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

tools/hotplug: use XEN_LOCK_DIR instead of hardcoded path

Use XEN_LOCK_DIR because it is a compiletime setting.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

tools/hotplug: create XEN_LOCK_DIR at runtime

Create XEN_LOCK_DIR because it is a compiletime setting. Also /var/lock
might be empty on startup because it is a tmpfs mount point.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

tools/hotplug: create XEN_RUN_DIR at runtime

Create XEN_RUN_DIR instead of hardcoded path because it is a compiletime
setting. Also /var/run might be empty on startup because it is a tmpfs
mount point.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

tools/pygrub: store kernels in /var/run/xen/pygrub

Move location of temporary bootfiles from /var/run/xend/boot to
/var/run/xen/pygrub. Create the subdirectory if does not exist.
The <dir> argument --output-directory must be an existing directory.

The reason for this change is that all entrys below /var/run have to be
created at runtime in case /var/run is cleared on every boot.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

tools: remove obsolete path.py from tools/python

The directory tools/python/xen/util does not exist.
Upcoming changes to genpath will fail if the rule persists.
Nothing uses path.py (anymore?), so get rid it.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- removed from .gitignore too ]

tools/mkrpm: allow custom rpm package name

Even if xen is configured and compiled with different --prefix= so that
it operates entirely below $prefix, the resulting package from 'make
rpmball' is always called "xen.rpm".

Use an environment name to give a different name.
This can be used like this:

suffix=-bugN
prefix=/opx/xen/staging${suffix}
./configure --prefix=${prefix}
make rpmball PKG_SUFFIX=${suffix}

The result will be "xen-bugN.rpm" instead of "xen.rpm". The benefit is that
many xen${suffix}.rpm packages can be installed at the same time.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

install.sh: Preserve permissions from make install

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

tools/xenpaging: create dumpdir with mode 0700

The swapfile contain sensitive guest info.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

stubdom: fix lwip compile

stubdom/lwip-x86_64/src/core/dhcp.c: In function 'dhcp_create_request':
stubdom/lwip-x86_64/src/core/dhcp.c:1359:71: error: array subscript is above array bounds [-Werror=array-bounds]
dhcp->msg_out->chaddr[i] = (i < netif->hwaddr_len) ? netif->hwaddr[i] : 0/* pad byte*/;

gcc can not know if hwaddr_len exceeds the hwaddr array size,
so force an upper limit to assist gcc.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>

xen: arm: Enable physical address space compression (PDX) on arm

This allows us to support sparse physical address maps which we previously
could not because the frametable would end up taking up an enormous fraction
of RAM.

On a fast model which has RAM at 0x80000000-0x100000000 and
0x880000000-0x900000000 this reduces the size of the frametable from
478M to 84M.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>

xen: add helpers for PDX mask initialisation calculations

I wanted to make fill_mask a public function so I could use it on ARM, but it
was actually easier to think of a (semi) reasonable public name for the users
of it, so that is what I have done.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>

xen: refactor physical address space compression support into common code

The "pdx compression" functionality will be useful on ARM as well.

Move the code to common code+header and introduce HAS_PDX to control when it is
built. L2_PAGETABLE_SHIFT is x86 specific, so introduce PDX_GROUP_SHIFT to
abstract it out.

ARM has no need for superpage compression (yet?) and lacks SUPERPAGE_SHIFT so
those functions (spage_to_mfn et al) are not moved.

No affect on x86 and no change for ARM (yet).

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen: arm: support for up to 48-bit IPA addressing on arm64

Currently we support only 40-bits. This is insufficient on systems where
peripherals which need to be 1:1 mapped to dom0 are above the 40-bit limit.

Unfortunately the hardware requirements are such that this means that the
number of levels in the P2M is not static and must vary with the number of
implemented physical address bits. This is described in ARM DDI 0487A.b Table
D4-5. In short there is no single p2m configuration which supports everything
from 40- to 48- bits.

For example a system which supports up to 40-bit addressing will only support 3
level p2m (maximum SL0 is 1 == 3 levels), requiring a concatenated page table
root consisting of two pages to make the full 40-bits of addressing.

A maximum of 16 pages can be concatenated meaning that a 3 level p2m can only
support up to 43-bit addresses. Therefore support for 48-bit addressing
requires SL0==2 (4 levels of paging).

After the previous patches our various p2m lookup and manipulation functions
already support starting at arbitrary level and with arbitrary root
concatenation. All that remains is to determine the correct settings from
ID_AA64MMFR0_EL1.PARange for which we use a lookup table.

As well as supporting 44 and 48 bit addressing we can also reduce the order of
the first level for systems which support only 32 or 36 physical address bits,
saving a page.

Systems with 42-bits are an interesting case, since they only support 3 levels
of paging, implying that 8 pages are required at the root level. So far I am
not aware of any systems with peripheral located so high up (the only 42-bit
system I've seen has nothing above 40-bits), so such systems remain configured
for 40-bit IPA with a pair of pages at the root of the p2m.

Switching to symbolic names for the VTCR_EL2 bits as we go improves the clarity
of the result.

Parts of this are derived from "xen/arm: Add 4-level page table for
stage 2 translation" by Vijaya Kumar K.

arm32 remains with the static 3-level, 2 page root configuration.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>

xen: arm: support for up to 48-bit physical addressing on arm64

This only affects Xen's own stage one paging.

- Use symbolic names for TCR bits for clarity.
- Update PADDR_BITS
- Base field of LPAE PT structs is now 36 bits (and therefore
unsigned long long for arm32 compatibility)
- TCR_EL2.PS is set from ID_AA64MMFR0_EL1.PASize.
- Provide decode of ID_AA64MMFR0_EL1 in CPU info

Parts of this are derived from "xen/arm: Add 4-level page table for
stage 2 translation" by Vijaya Kumar K.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>

xen: arm: handle variable p2m levels in apply_p2m_changes

As with previous changes this involves conversion from a linear series of
lookups into a loop over the levels.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>

xen: arm: handle variable p2m levels in p2m_lookup

This paves the way for boot-time selection of the number of levels to
use in the p2m, which is required to support both 40-bit and 48-bit
systems. For now the starting level remains a compile time constant.

Implemented by turning the linear sequence of lookups into a loop.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>

xen: arm: Defer setting of VTCR_EL2 until after CPUs are up

Currently we retain the hardcoded values but soon we will want to calculate the
correct values based upon the CPU properties common to all processors, which
are only available once they are all up.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>

xen: arm: move setup_virt_paging to p2m.[ch] from mm.[ch]

This file is where most of the P2M logic lives and this function will
eventually need to poke at some internals, so move it.

This is pure code motion.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Julien Grall <julien.grall@linaro.org>

xen: arm: handle concatenated root tables in dump_pt_walk

ARM allows for the concatenation of pages at the root of a p2m (but not a
regular page table) in order to support a larger IPA space than the number of
levels in the P2M would normally support. We use this to support 40-bit guest
addresses.

Previously we were unable to dump IPAs which were outside the first page of the
root. To fix this we adjust dump_pt_walk to take the machine address of the
page table root instead of expecting the caller to have mapped it. This allows
the walker code to select the correct page to map.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>

xen: arm: Implement variable levels in dump_pt_walk

This allows us to correctly dump 64-bit hypervisor addresses, which use a 4
level table.

It also paves the way for boot-time selection of the number of levels to use in
the p2m, which is required to support both 40-bit and 48-bit systems.

To support multiple levels it is convenient to recast the page table walk as a
loop over the levels instead of the current open coding.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>

xen: arm: rename p2m->first_level to p2m->root.

This was previously part of Vijaya's "xen/arm: Add 4-level page table
for stage 2 translation" but is split out here to make that patch
easier to read.

I went with ->root rather than ->root_level as the original did.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>
Cc: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>

tools: libxl: read nictype from xenstore

We need to use nictype to get default vifname.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

tools: libxl: pass correct file to qemu if we use blktap2

If we use blktap2, the correct file should be blktap device
not the pdev_path.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

tools: libxc: restore: csum the correct page

In verify mode, we map the guest memory, and the guest page is
region_base + i * PAGE_SIZE. So we should csum page (region_base
+ i * PAGE_SIZE), not (region_base + (i+curbatch) * PAGE_SIZE)

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

tools: libxc: restore: copy the correct page to memory

apply_batch() only handles MAX_BATCH_SIZE pages at one time. If
there is some bogus/unmapped/allocate-only/broken page, we will
skip it. So when we call apply_batch() again, the first page's
index is curbatch - invalid_pages. invalid_pages stores the number
of bogus/unmapped/allocate-only/broken pages we have found.

In many cases, invalid_pages is 0, so we don't catch this error.

Signed-off-by: Hong Tao <bobby.hong@huawei.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

Update libfdt to v1.4.0

Update libfdt to v1.4.0 of libfdt taken from git://git.jdl.com/software/dtc.git
Xen changes to libfdt_env.h carried over from existing libfdt (v1.3.0)
This update provides the fdt_create_empty_tree() function used by the ARM
EFI boot code.

Signed-off-by: Roy Franz <roy.franz@linaro.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

add arm64 cache flushing code from linux v3.16

__flush_dcache_all added from arch/arm64/mm/cache.S, with helper macros from
arch/arm64/include/asm/assembler.h, from v3.16. The cache flushing is required
when transitioning from EFI code that runs with cache enable to Xen startup
code which expects the cache to be disabled.

Signed-off-by: Roy Franz <roy.franz@linaro.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- removed indent on ENTRY() and dropped the entry point label which
duplicates the one from the macro. ]

VT-d: suppress UR signaling for further desktop chipsets

This extends commit d6cb14b34f ("VT-d: suppress UR signaling for
desktop chipsets") as per the finally obtained list of affected
chipsets from Intel.

Also pad the IDs we had listed there before to full 4 hex digits.

This is CVE-2013-3495 / XSA-59.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Yang Zhang <yang.z.zhang@intel.com>

x86: handle resumed instruction based on previous mem_event reply

In a scenario where a page fault that triggered a mem_event occured,
p2m_mem_access_check() will now be able to either 1) emulate the
current instruction, or 2) emulate it, but don't allow it to perform
any writes.

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86, libxc: force-enable relevant MSR events

Vmx_disable_intercept_for_msr() will now refuse to disable interception of
MSRs needed for memory introspection. It is not possible to gate this on
mem_access being active for the domain, since by the time mem_access does
become active the interception for the interesting MSRs has already been
disabled (vmx_disable_intercept_for_msr() runs very early on).

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Tim Deegan <tim@xen.org>

x86: optimize introspection access to guest state

Speed optimization for introspection purposes: a handful of registers
are sent along with each mem_event. This requires enlargement of the
mem_event_request / mem_event_response stuctures, and additional code
to fill in relevant values. Since the EPT event processing code needs
more data than CR3 or MSR event processors, hvm_mem_event_fill_regs()
fills in less data than p2m_mem_event_fill_regs(), in order to avoid
overhead. Struct hvm_hw_cpu has been considered instead of the custom
struct mem_event_regs_st, but its size would cause quick filling up
of the mem_event ring buffer.

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

x86/HVM: emulate with no writes

Added support for emulating an instruction with no memory writes.
Additionally, introduced hvm_emulate_one_full(), which inspects
possible return values from the hvm_emulate_one() functions
(EXCEPTION, UNHANDLEABLE) and acts on them.

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/HVM: batch vCPU wakeups

Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
especially when many of the remote pCPU-s are in deep C-states. For
64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
accumulated times of over 2ms were observed (average 1.1ms).
Considering that Windows broadcasts IPIs from its timer interrupt,
which at least at certain times can run at 1kHz, it is clear that this
can't result in good guest behavior. In fact, on said hardware guests
with significantly beyond 40 vCPU-s simply hung when e.g. ServerManager
gets started.

This isn't just helping to reduce the number of ICR writes when the
host APICs run in clustered mode, it also reduces them by suppressing
the sends altogether when - by the time
cpu_raise_softirq_batch_finish() is reached - the remote CPU already
managed to handle the softirq. Plus - when using MONITOR/MWAIT - the
update of softirq_pending(cpu), being on the monitored cache line -
should make the remote CPU wake up ahead of the ICR being sent,
allowing the wait-for-ICR-idle latencies to be reduced (perhaps to a
large part due to overlapping the wakeups of multiple CPUs).

With this alone (i.e. without the IPI avoidance patch in place),
average broadcast times for a 64-vCPU guest went down to a measured
maximum of 310us. With that other patch in place, improvements aren't
as clear anymore (short term averages only went down from 255us to
250us, which clearly is within the error range of the measurements),
but longer term an improvement of the averages is still visible.
Depending on hardware, long term maxima were observed to go down quite
a bit (on aforementioned hardware), while they were seen to go up
again on a (single core) Nehalem (where instead the improvement on the
average values was more visible).

Of course this necessarily increases the latencies for the remote
CPU wakeup at least slightly. To weigh between the effects, the
condition to enable batching in vlapic_ipi() may need further tuning.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>

x86: suppress event check IPI to MWAITing CPUs

Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
especially when many of the remote pCPU-s are in deep C-states. For
64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
accumulated times of over 2ms were observed (average 1.1ms).
Considering that Windows broadcasts IPIs from its timer interrupt,
which at least at certain times can run at 1kHz, it is clear that this
can't result in good guest behavior. In fact, on said hardware guests
with significantly beyond 40 vCPU-s simply hung when e.g. ServerManager
gets started.

Recognizing that writes to softirq_pending() already have the effect of
waking remote CPUs from MWAITing (due to being co-located on the same
cache line with mwait_wakeup()), we can avoid sending IPIs to CPUs we
know are in a (deep) C-state entered via MWAIT.

With this, average broadcast times for a 64-vCPU guest went down to a
measured maximum of 255us (which is still quite a lot).

One aspect worth noting is that cpumask_raise_softirq() gets brought in
sync here with cpu_raise_softirq() in that now both don't attempt to
raise a self-IPI on the processing CPU.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>

x86/hvm: always set pending event injection when loading VMC[BS] state

In colo mode, secondary vm is running, so VM_ENTRY_INTR_INFO may
valid before restoring vmcs. If there is no pending event after
restoring vm, we should clear it.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Also clear pending software exceptions.
Copy the fix to SVM as well.

Signed-off-by: Tim Deegan <tim@xen.org>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>

x86/p2m: fix conversion macro of p2m_access to XENMEM_access

Signed-off-by: Tamas K Lengyel <tklengyel@sec.in.tum.de>
Acked-by: Tim Deegan <tim@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Merge branch 'staging' of ssh://xenbits.xen.org/home/xen/git/xen into staging

xl: long output of "list" command now contains Dom0 information

As we've already generated a JSON config for Dom0, print that out when
requested.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

xl: use libxl_retrieve_domain_configuration and JSON format

Before this change, xl stores domain configuration in "xl" format, which
is in fact a verbatim copy of user supplied domain config.

Now libxl provides a new API to retrieve domain configuration, switch to
that new API, store configuration in JSON format.

Tests done so far (xl.{new,old} denotes xl with{,out} "libxl-json"
support):

1. xl.new create then xl.new save, hexdump saved file: domain config
saved in JSON format
2. xl.new create, xl.new save then xl.old restore: failed on
mandatory flag check
3. xl.new create, xl.new save then xl.new restore: succeeded
4. xl.old create, xl.old save then xl.new restore: succeeded
5. xl.new create then local migrate, receiving end xl.new: succeeded
6. xl.old create then local migrate, receiving end xl.new: succeeded

Note that "xl" config is still supported and handled when restarting a
domain. "xl" config file takes precedence over "libxl-json" in that
case, so that user who uses "config-update" to store new config file
won't have regression. All other scenarios (migration, domain listing
etc.) now use the new API.

Lastly, print out warning when users invoke "config-update" to
discourage them from using this command.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

libxl: introduce libxl_userdata_unlink

This will be used in later patch for xl to remove its "xl" userdata
file.

Both CTX lock and userdata lock are taken in this API. CTX lock is taken
to maintain locking hierarchy, but it also has a side effect to protect
against R-M-W by other threads. Userdata lock is used to protect against
domain destruction.

In general application should not rely on these internal locks to
protect its own userdata files. It should deploys its own lock if it
cares.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

libxl: introduce libxl_retrieve_domain_configuration

Introduce a new public API to return domain configuration. This returned
configuration can be used to rebuild a domain.

Note that this configuration only describes the configuration necessary
to reproduce the guest visible state and does not necessarily include
specific decisions made by the toolstack regarding its current
incarnation (e.g. disk backend) unless they were specified by the
application when the domain was created.

With this approach we can preserve what user has provided in the
original configuration as well as valuable information from xenstore.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

libxl: refactor libxl_get_memory_target

Introduce a helper function which can return both "target" node and
"static-max" node of a domain. Reimplement libxl_get_memory_target using
this helper. libxl__fill_dom0_memory_info is adjusted as well.

This helper will be used in later patch.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

libxl: make libxl_cd_insert "eject" + "insert"

We introduce an intermediate empty state when inserting media into
CDROM. The scheme works like this:

  lock json config
  write empty state to xenstore
  for (;;) {
      write user supplied disk to JSON
      write disk information to xenstore
  }
  unlock json config

Bear in mind that all steps can fail. With the proposed scheme, we now
know, if xenstore is empty, then CDROM should be considered empty;
otherwise we should use JSON version of CDROM configuration.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

libxl: synchronise configuration when we hotplug a device

We update JSON version first, then write to xenstore, so that we
maintain the following invariant: any device which is present in
xenstore has a corresponding entry in JSON.

The workflow is as followed:
   lock json config
       read json config
       update in-memory json config with new entry, replacing
         any stale entry
       for loop
           open xs transaction
           check device existence, abort if it exists
           write in-memory json config to disk
           commit xs transaction
       end for loop
   unlock json config

Please see comment in libxl_internal.h for correctness proof.

As those routines are called both during domain creation and device
hotplug, we add a flag to indicate whether we need to update JSON
config. This flag is only set to true when we hotplug a device. We
cannot update JSON config during domain creation as JSON config is
committed to disk only when domain creation finishes.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

libxl: rework domain userdata file lock

The lock introduced in d2cd9d4f ("libxl: functions to lock / unlock
libxl userdata store") has a bug that can leak the lock file when domain
destruction races with other functions that try to get hold of the lock.

There are several issues:
1. The lock is released too early with libxl__userdata_destroyall
   deletes everything in userdata store, including the lock file.
2. The check of domain existence is only done at the beginning of lock
   function, by the time the lock is acquired, the domain might have
   been gone already.

The effect of this two issues is we can run into such situation:

     Process 1                        Process 2 domain destruction
   # LOCK FUNCTION                 # LOCK FUNCTION
    check domain existence          check domain existence
                                    acquire lock (file created)
                                   # LOCK FUNCTION
                                    destroy all files (lock file deleted,
                                                       lock released)
    acquire lock (file created)
   # LOCK FUNCTION                  destroy domain
                                   # UNLOCK (close fd only)
   [ lock file leaked ]

Fix this problem by deploying following changes:

1. Unlink lock file in unlock function.
2. Modify libxl__userdata_destroyall to not delete domain-userdata-lock,
   so that the lock remains held until unlock function is called.
3. Check domain still exists when the lock is acquired, unlock if
   domain is already gone.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

VMX: don't unintentionally leave x2APIC MSR intercepts disabled

These should be re-enabled in particular when the virtualized APIC
transitions to HW-disabled state.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>

x86: show page walk when create_bounce_frame() encounters a fault

... getting the native code in sync with the compat mode one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

passthrough: streamline _hvm_dirq_assist()

The loop inside this function was calling two functions with loop-
invariable arguments which clearly don't need calling more than once:
send_guest_pirq() and __msi_pirq_eoi(). After moving these out of the
loop it further became apparent that folding the hvm_pci_msi_assert()
helper into the main function can further help readability.

In the course of this I noticed that __hvm_dpci_eoi() called
hvm_pci_intx_deassert() unconditionally, whereas hvm_pci_intx_assert()
(correctly) got called only when !hvm_domain_use_pirq(), so the former
is being made conditional now too.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>